llama-index 奮闘ログ

ゴール：ぼざろメンバーの関係性を mermaid 記法で表現

難しそうなポイント

追加データの embedding をクエリするのと、そこから mermaid 記法で記述するためのデータソースが離れている？

chunk で文脈が失われる

多分できそうと踏んでる理由

https://twitter.com/mikito0521/status/1635752303155376129

素のコードはだいたいこんな感じ

code:gpt-handler.py

index = GPTSimpleVectorIndex([])

for doc in docs:

index.insert(doc)

llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="gpt-3.5-turbo"))

index.query(

query,

llm_predictor=llm_predictor

)

データソースはWikipedia

アイデア１：Mermaidの書き方も覚えてもらう

Index を作成するときに mermaid に関するページを SimpleWebPageReader で読んでもらう

https://gpt-index.readthedocs.io/en/latest/reference/readers.html#gpt_index.readers.SimpleWebPageReader

mermaid なんて知らね〜と言われて終わり

雰囲気こんな感じ

code:gpt-handler.py

index = GPTSimpleVectorIndex([])

for doc in docs:

index.insert(doc)

mermaid_docs = SimpleWebPageReader().load_data([

"https://notepm.jp/help/mermaid",

"https://mermaid.js.org/intro/",

"https://mermaid.js.org/config/Tutorials.html",

)

for doc in mermaid_docs:

index.insert(doc)

llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="gpt-3.5-turbo"))

index.query(

query,

llm_predictor=llm_predictor

)

アイデア２：Tree Indexを使う

https://note.com/__ramu0e__/n/ndb9a1d1cab49 を眺めて summarize に良さそうに思えた

公式には

https://gpt-index.readthedocs.io/en/latest/guides/index_guide.html#tree-index

https://gpt-index.readthedocs.io/en/latest/reference/indices/tree.html

これ意外と難しい

テキストサイズが大きくていい感じに split できずに indexing できない

code:log

ValueError: A single term is larger than the allowed chunk size.

Term size: 477

Chunk size: 339Effective chunk size: 339

https://github.com/jerryjliu/llama_index/issues/453

code:python

template = (

"以下は予備知識ではなく、質問です。 "

"---------------------\n"

"{context_str}"

"\n---------------------\n"

f"answer the question: {query}\n"

)

prompt = SummaryPrompt(template)

index_with_query = GPTTreeIndex(documents, summary_template=prompt)

これはこれでクエリに特化した index になってしまい汎用性がないなぁ

クエリごとに index を構築すると思うとやや料金が気になるところ

たとえば Wikipedia は wikipedia で一旦 index を作るみたいにして、分けておいて、後からガッチャンコもできそう？

アイデア３：関係性だけ出力させて、それを素のChatGPTに読ませて変換させる

これを試してみるとそもそも関係性の出力がうまくなかった

固有表現抽出が失敗している？

これはある気がするな〜 ChatGPT3.5 が特に計算できなかったりするし。

アイデア４：temparture などのパラメータを変えてみる

温度が高いほど創発的になったはず？

と思いきやデフォルトが 0.7 でそれなりに高く見える

アイデア５：predictor ではなく、embedding のときにも gpt-3 を使ってみる？

これを始めてから少しマシになった気がする？

---

いろいろやってみて、アイデア3ベースがうまくいった

https://scrapbox.io/files/6415a2beaaa070001c3e8e25.png

と思ったのもつかの間、GPTTreeIndexが

INFO:root:I'm sorry, but it is not possible to answer this question as the provided context information does not indicate any information about Kanata Hoshijima's abilities, education, or hobbies.

と言っており、ChatGPTがいい感じに返していたのだった

アイデア6：KeywordTreeIndex を使ってみる

使ってみたけどこれはかなり時間がかかる。Lambdaの制限である5分を余裕で超過するのでパス。

https://community.openai.com/t/splitting-chunking-large-input-text-for-summarisation-greater-than-4096-tokens/18494